Skip to content

Conversation

@marevol
Copy link
Contributor

@marevol marevol commented Nov 16, 2025

This pull request introduces several improvements to thread safety and error handling in the data store and index update callback implementations. The most significant changes involve switching to thread-safe collections and variables, adding synchronization and volatile keywords for shared state, and improving robustness when loading plugin files.

Thread safety improvements:

  • Changed dataStoreMap in DataStoreFactory to use ConcurrentHashMap for safe concurrent access.
  • Made dataStoreNames and lastLoadedTime in DataStoreFactory volatile to ensure visibility across threads, and synchronized the getDataStoreNames method. [1] [2]
  • Updated the alive flag in AbstractDataStore to be volatile for safe multi-threaded access.
  • Changed deleteUrlList in FileListIndexUpdateCallbackImpl to use CopyOnWriteArrayList for thread-safe batch processing.

Error handling and robustness:

  • Improved loadDataStoreNameList in DataStoreFactory to handle null plugin JAR files and missing configuration files gracefully, with debug logging for missing files.

Dependency and import fixes:

  • Corrected the import for StringUtil in FileListIndexUpdateCallbackImpl to use the correct package.
  • Added missing import for CopyOnWriteArrayList in FileListIndexUpdateCallbackImpl.

Code organization:

  • Moved the default constructor in AbstractDataStore to a more logical location in the file. [1] [2]

@marevol marevol requested a review from Copilot November 16, 2025 11:25
@marevol marevol self-assigned this Nov 16, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enhances thread safety and error handling in the data store implementation. The changes focus on ensuring safe concurrent access to shared state and improving robustness when dealing with plugin files.

Key changes:

  • Introduced thread-safe collections (ConcurrentHashMap, CopyOnWriteArrayList) and volatile variables to protect shared state across threads
  • Added null checks and graceful error handling for missing plugin files and configuration files
  • Fixed import statement to use the correct StringUtil package

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
FileListIndexUpdateCallbackImpl.java Changed deleteUrlList to CopyOnWriteArrayList for thread safety and corrected StringUtil import
DataStoreFactory.java Converted dataStoreMap to ConcurrentHashMap, made dataStoreNames and lastLoadedTime volatile, synchronized getDataStoreNames method, and added null/existence checks for plugin files
AbstractDataStore.java Made alive flag volatile for thread visibility and relocated the default constructor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 93 to 94
/** List of URLs to be deleted, cached for batch processing. Thread-safe using CopyOnWriteArrayList. */
protected List<String> deleteUrlList = new CopyOnWriteArrayList<>();
Copy link

Copilot AI Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CopyOnWriteArrayList for deleteUrlList may cause performance issues if URLs are frequently added. This collection creates a copy of the underlying array on every write operation, which could be expensive for batch processing scenarios. Consider using a synchronized ArrayList or Collections.synchronizedList() if the list is primarily accessed within synchronized blocks during batch operations.

Suggested change
/** List of URLs to be deleted, cached for batch processing. Thread-safe using CopyOnWriteArrayList. */
protected List<String> deleteUrlList = new CopyOnWriteArrayList<>();
/** List of URLs to be deleted, cached for batch processing. Thread-safe using Collections.synchronizedList. */
protected List<String> deleteUrlList = Collections.synchronizedList(new ArrayList<>());

Copilot uses AI. Check for mistakes.
* Thread-safe implementation using ConcurrentHashMap for concurrent access.
*/
protected Map<String, DataStore> dataStoreMap = new LinkedHashMap<>();
protected Map<String, DataStore> dataStoreMap = new ConcurrentHashMap<>();
Copy link

Copilot AI Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from LinkedHashMap to ConcurrentHashMap removes insertion-order guarantees. If any code depends on the iteration order of data stores, this change could introduce subtle bugs. Verify that insertion order is not required for this map's usage.

Copilot uses AI. Check for mistakes.
@marevol marevol changed the title Improve datastore implementation Improve data store handling with thread safety and resource management enhancements Nov 20, 2025
@marevol marevol added this to the 15.4.0 milestone Nov 20, 2025
…t enhancements

This commit addresses several improvements in the data store processing implementation:

**Thread Safety Improvements:**
- DataStoreFactory: Changed dataStoreMap from LinkedHashMap to ConcurrentHashMap for thread-safe concurrent access
- DataStoreFactory: Made dataStoreNames and lastLoadedTime volatile to ensure visibility across threads
- DataStoreFactory: Made getDataStoreNames() synchronized to prevent race conditions during cache refresh
- AbstractDataStore: Made alive field volatile to ensure proper visibility in multi-threaded scenarios
- FileListIndexUpdateCallbackImpl: Changed deleteUrlList from ArrayList to CopyOnWriteArrayList for thread-safe access from executor threads

**Resource Handling:**
- DataStoreFactory: Added null check for jarFiles array to prevent NullPointerException
- DataStoreFactory: Added existence check for XML configuration files before attempting to read them, reducing unnecessary exception handling

**Code Quality:**
- AbstractDataStore: Moved static logger declaration before instance fields following Java best practices
- FileListIndexUpdateCallbackImpl: Fixed incorrect StringUtil import (was using Apache POI's StringUtil instead of Fess's)

These improvements enhance the reliability and thread safety of the data store processing layer,
particularly important for concurrent crawling operations.
Changed deleteUrlList from CopyOnWriteArrayList to ArrayList since all access
is already protected by synchronized(indexUpdateCallback) blocks. CopyOnWriteArrayList
creates a copy on every write operation which is unnecessary overhead when synchronization
is already in place.

Changes:
- Replaced CopyOnWriteArrayList with ArrayList for deleteUrlList
- Added synchronized block in commit() method for consistency
- Updated javadoc to clarify synchronization strategy

This improves performance for batch delete operations by eliminating unnecessary
array copying on each URL addition.
Fixed two failing tests that were using 'echo' commands which complete
almost instantly, making it impossible to verify process running state:

1. test_startProcess_replaceExistingProcess:
   - Changed from 'echo' to 'sleep 10' commands
   - Added proper wait time before checking process state
   - Added verification that only one process exists for the session

2. test_destroyProcess_withRunningProcess:
   - Changed from 'echo' to 'sleep 10' command
   - Simplified polling logic - just wait 100ms for process to start
   - Updated exit code assertion (forcibly destroyed processes may have non-zero exit codes)

These changes ensure the tests can reliably verify that processes are running
before attempting to check their state or destroy them.
Added extensive test coverage for the thread safety improvements made to
the data store handling layer.

DataStoreFactoryTest additions:
- test_add_concurrentAccess: Verifies ConcurrentHashMap handles concurrent additions
- test_getDataStore_concurrentWithAdd: Tests concurrent read/write operations
- test_getDataStoreNames_concurrentAccess: Verifies synchronized method correctness
- test_volatileFields_visibility: Ensures volatile fields are visible across threads
- test_cacheRefresh_withConcurrentReads: Validates cache refresh with concurrent access
- test_nullSafety_concurrentAccess: Tests null handling under concurrent load

AbstractDataStoreTest additions:
- test_aliveField_volatileVisibility: Verifies volatile alive field visibility
- test_stop_volatileVisibility: Tests stop() method sets alive and is visible
- test_stop_concurrentAccess: Validates concurrent calls to stop()
- test_aliveField_concurrentReadWrite: Tests concurrent read/write of alive field
- test_getName_concurrentAccess: Verifies getName() is thread-safe

FileListIndexUpdateCallbackImplTest additions:
- test_deleteUrlList_synchronizedAccess: Validates ArrayList with synchronized blocks
- test_deleteUrlList_concurrentReads: Tests concurrent reads from deleteUrlList
- test_deleteUrlList_clearOperation: Verifies clear() operation thread safety
- test_deleteUrlList_iteration: Tests iteration with proper synchronization
- test_deleteUrlList_isEmptyCheck: Validates isEmpty() check under concurrent access

These tests verify:
1. ConcurrentHashMap correctly handles concurrent operations in DataStoreFactory
2. Volatile fields ensure proper visibility across threads
3. Synchronized methods prevent race conditions during cache refresh
4. ArrayList works correctly when all access is properly synchronized
5. No ConcurrentModificationException occurs in any concurrent scenario

All tests use minimal or no sleep to maintain fast test execution time.
Based on feedback that dataStoreMap is not accessed by multiple threads,
reverted the ConcurrentHashMap change back to LinkedHashMap.

Changes:
- Reverted dataStoreMap from ConcurrentHashMap to LinkedHashMap
- Removed ConcurrentHashMap import
- Removed dataStoreMap-related concurrency tests:
  * test_add_concurrentAccess
  * test_getDataStore_concurrentWithAdd
  * test_nullSafety_concurrentAccess

Kept volatile fields and synchronized getDataStoreNames():
- lastLoadedTime and dataStoreNames remain volatile (for cache visibility)
- getDataStoreNames() remains synchronized (for cache refresh safety)
- Related tests for getDataStoreNames() concurrency are maintained

The dataStoreMap is accessed in single-threaded context during application
initialization and data store registration.
@marevol marevol force-pushed the claude/improve-datastore-handling-01HotyVCm32VMRRd3hCT6bD3 branch from 3e6abba to ffe107b Compare November 20, 2025 13:06
@marevol marevol merged commit 083092a into master Nov 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants